ggplot2 is an R package for producing visualizations of
data. Unlike many graphics packages, ggplot2 uses a conceptual framework
which allows you to create a graph from composable elements, instead of
being limited to a predefined set of charts. There are many great
resources online (for example The R
Graphics Cookbook, R
for Data Science, or R
Cheat Sheet).
The goal of this session is to familiarize with the basic concepts and graphs. We are going to use the Student Survey Data to create example graphs and produce some ggplot2 code that you can build up on for your own assignments.
If you later want to go further into this topic and you are looking
for innovation, look at the rich list of ggplot2 extensions (it is worth
it). A community-kept list can be found here. We will use
the following extensions today: ggthemes,
ggpubr, ggpie, treemapify, and
waffle.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read.csv("../00_data/survey/survey_processed.csv") # Read in student survey
# Drop all rows with missing observations in geschlecht, einkommen,
# fleisch or wegdauer
data <- data %>%
drop_na(einkommen, fleisch, wegdauer) %>%
mutate(organ_donor = as.factor(organ_donor),
female = as.factor(female))A ggplot graph consists of mutliple components. By layering these components, users can create customized visualizations.
Out of these components, ggplot2 needs at least the following three to produce a chart: data, a mapping, and a geometry.
ggplot2 uses data to construct a plot. The system works best if the
data is provided in a tidy format. As the first step in many plots, we
pass the data to the ggplot() function, which stores the
data to be used later by other parts of the plotting system. Using the
Student Survey data, we start as follows:
The aestethics or mapping of a plot is a set of instructions on how
parts of the data are mapped onto aesthetic attributes of geometric
objects. A mapping can be made by using the aes() function
to make pairs of graphical attributes. If we want the
einkommen and wegdauer columns to map to the
x- and y-coordinates in the plot, we can do that as follows:
The heart of any graphic is the geometry component. It takes the mapped data and displays it in some way we understand as a representation of the data. The geometry you choose depends on what you want to show:
There are many more options. Check out this link for a more complete list.
Point plots are a versatile and effective tool for visualizing relationships between two continuous variables. They are particularly useful for identifying trends, clusters, and outliers in data.
# Creates a basic point plot and add a line layer.
ggplot(data, aes(x = einkommen, y = paar_schuhe))+
geom_point()+
geom_line(colour = "blue", alpha = 0.3)# Creates a point plot that uses two different colors for males and females.
ggplot(data, aes(x = einkommen, y = paar_schuhe))+
geom_point(aes(color = female))# Adds a fitted line for the relationship between x and y.
ggplot(data, aes(x = einkommen, y = paar_schuhe))+
geom_point(aes(colour = female))+
geom_smooth(method = "lm", se = TRUE)+
geom_smooth(method = "lm", se = TRUE, aes(color = female))Geometries can have their own data, their own aesthetics, some create summary statistics of the data (e.g. histograms).
We can also use the piping operator %>% in
combination with ggplot grahps!
Box plots are an essential tool for visualizing the distribution of a continuous variable. They provide a clear summary of data by displaying key statistics, such as the median, quartiles, and potential outliers.
# Creates a basic boxplot for einkommen by gender.
data %>%
ggplot(aes(x =female, y = einkommen))+
geom_boxplot()+
geom_point(aes(size = fachsemester, shape = organ_donor), alpha = 0.2)## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
Bar plots are an excellent tool for visualizing the distribution of categorical data or comparing values across different categories. They represent data with rectangular bars, where the height (or length) of each bar corresponds to the frequency or value of a category.
# Creates a basic barplot for organ_donor.
data %>%
filter(!is.na(organ_donor)) %>%
ggplot(aes(x = organ_donor))+
geom_bar()# Adds different colors for males and females and adds a legend on the side.
data %>%
ggplot(aes(x = organ_donor, fill = female))+
geom_bar()# PLots the bars side-by-side.
data %>%
ggplot(aes(x = organ_donor, fill = female))+
geom_bar(position = "dodge")Saving a ggplot as an R object.
# Saves the point plot as an R object under the name 'p'.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")
pSaving a ggplot to a jpeg or png format.
# Uses the png function to save the plot under the specified path.
# Note that it is important to use 'print(p)' before saving it,
# otherwise the plot will be empty.
# Using a predefined size for the plot, insures that the size doesn't change
# between session.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")
png("results/fig_einkommen_wegdauer.png",width = 801, height = 456)
print(p)
dev.off()## quartz_off_screen
## 2
Facets can be used to separate small multiples, or different subsets of the data. It is a powerful tool to quickly split up the data into smaller panels, based on one or more variables, to display patterns or trends (or the lack thereof) within the subsets.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")
# Divides the plot into two parts for female and males,
p <- p + facet_wrap(~female)
p# Additionally, splits the plot into a grid of panels based on the sidejob
# (whether individuals have a side job) and male (gender) variables,
# creating a separate plot for each combination of these variables.
g <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)
gSince we can’t distinguish between the subgroups anymore, we should relabel the facets to make names more clear.
female.labs <- c("Female", "Male")
names(female.labs) <- c("1", "0")
sidejob.labs <- c("Side job", "No side job")
names(sidejob.labs) <- c("1", "0")
g <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female,
labeller = labeller(female = female.labs,
sidejob = sidejob.labs))
gYou can view the coordinates part of the plot as an interpreter of position aesthetics. While typically Cartesian coordinates are used, the coordinate system powers the display of map projections and polar plots.
# Flips the bar plot sideways
data %>%
ggplot(aes(x = organ_donor, fill = female))+
geom_bar()+
coord_flip()p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_wrap(~female)
p## `geom_smooth()` using formula = 'y ~ x'
# Edits the title, axis titles, etc.
p <- p + labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commute by income and gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")
p## `geom_smooth()` using formula = 'y ~ x'
Scales are important for translating what is shown on the graph back to an understanding of the data. The scales typically form pairs with aesthetic attributes of the plots, and are represented in plots by guides, like axes or legends. Scales are responsible for updating the limits of a plot, setting the breaks, formatting the labels, and possibly applying a transformation.
Very nice package for a wide arrange of color palettes:
RColorbrewer.
# Adds specifies the Set1 color palette from RColorBrewer to define the
# colors for the organ_donor variable.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
scale_colour_brewer(palette = "Set1")
p## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
All colors have a specific id. In case you want to manually define your color palette, you can specify the colors like this.
# Defines a vector, containing the color ids.
cols <- c("#7570b3","#66a61e")
# Creates a plot using the predefined colors.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
scale_colour_manual(values = cols,
labels = c("1" = "Organ donor","0" = " No organ donor", "NA" = "Not stated"),
name = " ")
pA great website to browse for colors and color palettes: link.
# Defines a vector, containing the color ids.
cols <- c("#7570b3","#66a61e")
# Creates a plot using the predefined colors.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
scale_colour_manual(values = cols, labels = c("No organ donor", "Organ donor", "Not stated"),
name = " ")+
scale_x_continuous(limits = c(0,2000), breaks = c(0,250, 500, 750, 2000))
p## Warning: Removed 15 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_point()`).
The theme system controls almost any visuals of the plot that are not controlled by the data. You can use the theme for customizations ranging from changing the location of the legends to setting the background color of the plot.
# Play around with the themes and see how the appearence changes.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
theme_bw()
#theme_gray()
#theme_minimal()
#theme_light()
pIf you want some more cool themes, install the ggthemes package.
# Try out different themes from the ggthemes package.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = lm)+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
#theme_economist()
theme_fivethirtyeight()
pThe theme()function is very important. It enables you to
customize all text elements: sizes and fonts of axis labels, axis ticks,
legend title, legend labels, etc. and the position of legends. We see an
extensive example below. Please play around with the code and the values
to see the changes in the plot.
# The theme() function: Customizes various visual aspects:
# - axis.text: Sets axis text color (#1f78b4) and size (10).
# - axis.title.y/x: Sets axis title sizes to 14.
# - legend.position = "bottom": Places the legend at the bottom of the plot.
# - legend.box.background: Adds a background box around the legend.
# - legend.text: Sets legend text size (12) and color ("black").
# - legend.title: Makes legend title bold and sets size (12).
# - strip.text.x/y: Customizes facet strip text (size 12, black color, bold).
# - strip.background: Sets background of facet strips to a solid light blue
# (#a6cee3) with a black border.
# Play around with the settings in theme() to see what it does.
p <- data %>%
ggplot(aes(x = einkommen, y = wegdauer))+
geom_point(aes(color = organ_donor))+
geom_smooth(method = "lm")+
facet_grid(sidejob~female)+
labs(x = "Monthly income",
y = "Commuting distance (min.)",
title = "Commuting distance and monthly income by gender",
subtitle = " ",
caption = "Source: Student survey 2023 and 2024",
color = "Organ donor")+
scale_colour_manual(values = cols)+
theme_minimal()+
theme(axis.text = element_text(colour = "#1f78b4", size = 10),
axis.title.y = element_text(size = 20),
axis.title.x = element_text(size = 14),
legend.position="bottom",
legend.box.background = element_rect(),
legend.text = element_text(size = 12, colour = "black"),
legend.title = element_text(face = "bold", size = 12),
strip.text.x = element_text(size = 12, color = "black", face = "bold"),
strip.text.y = element_text(size = 12, color = "black", face = "bold"),
strip.background = element_rect(color="black", fill="#a6cee3", linetype="solid")
)
pTask:
# Creates data frame
bad_df <- data %>%
filter(!is.na(statistiknote_w1), !is.na(erstfach_name)) %>%
group_by(erstfach_name) %>%
summarise(mean_grade = mean(statistiknote_w1), n = dplyr::n(), .groups = "drop")
# Your task: Work on this plot!
ggplot(bad_df, aes(x = erstfach_name, y = mean_grade, fill = erstfach_name)) +
geom_col(color = "black") +
coord_cartesian(ylim = c(2.1, 3.2)) +
scale_fill_manual(values = c(
"BWL" = "#EF4444",
"Politik und Wirtschaft" = "#60A5FA",
"VWL" = "#3B82F6",
"Wirtschaftsinformatik" = "#93C5FD")
) +
labs(
title = "Average Exam Grade by Major!",
subtitle = "statistiknote_w1",
x = "MAJOR (FIRST FIELD!)",
y = "EXAM GRADE (?)",
caption = " "
) +
theme_gray(base_size = 16) +
theme(
axis.text.x = element_text(angle = 0),
legend.position = "right",
panel.grid.minor = element_line(color = "grey60")
) +
annotate("text", x = 3, y = 3.15, label = "WOW!", color = "red", size = 7)There are many other types of plots that might be useful, depending on the application. We will briefly discuss a few more of them in the following.
Waffle plots are useful for visualizing proportional data in a nice way. They display data as a grid of squares (or other shapes), where each square represents a fixed amount or percentage of the whole. This makes it easy to see the composition of categories within a dataset at a glance.
# Summarize the data
data2 <- data %>%
group_by(male, organ_donor, erstfach_name) %>% # Groups the data
summarize(Freq = n()) %>% # Calculates the frequency of each combination of the grouped variables
mutate(erstfach_name = as.factor(erstfach_name)) %>%
arrange(erstfach_name) %>%
ungroup() # Removes the grouping structure
# Prints the summarized data
print(data2)## # A tibble: 25 × 4
## male organ_donor erstfach_name Freq
## <int> <fct> <fct> <int>
## 1 0 0 BWL 14
## 2 0 1 BWL 31
## 3 0 <NA> BWL 2
## 4 1 0 BWL 12
## 5 1 1 BWL 37
## 6 1 <NA> BWL 2
## 7 0 0 Politik und Wirtschaft 10
## 8 0 1 Politik und Wirtschaft 10
## 9 0 <NA> Politik und Wirtschaft 2
## 10 1 0 Politik und Wirtschaft 8
## # ℹ 15 more rows
Here comes the waffle plot.
# Create waffle plot
data2 %>%
ggplot(aes(values=Freq, fill=erstfach_name))+
geom_waffle(
n_rows = 20, # Specifies the number of squares in each row.
color = "white", # Specifies border color.
flip = TRUE, # Defines orientation
na.rm=TRUE # Removes missing values
)+
facet_grid(~male)+ # Divides the plot by gender (remove to see the change)
coord_equal()+ # Ensures that each square has equal height and width
theme_minimal()+
theme(axis.title.y = element_blank(), # Removes title, axis titles and texts,
axis.text.y = element_blank(), # and the legend title
axis.text.x = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank())Treemaps are useful for visualizing hierarchical data and showing proportions within a whole. They represent data as nested rectangles, where the size of each rectangle corresponds to a specific value or proportion, making it easy to compare the relative sizes of categories or subcategories.
# Create a frequency table for erstfach
table <- data %>%
group_by(erstfach_name) %>%
filter(!is.na(erstfach_name)) %>%
summarize(sum_courses = n()) %>%
# Add a column with names we want to use later
mutate(fach = c("BWL\n(n = 96)",
"Politik und\nWirtschaft\n(n = 52)",
"VWL\n(n = 25)",
"Wirtschaftsinformatik\n(n = 10)"))
table Here comes the treemap.
# geom_treemap(): Adds the treemap geometry to visualize the data..
# geom_treemap_text(...): Adds text labels inside the rectangles.
# colour = "white": Sets the text color to white.
# place = "centre": Centers the text within each rectangle.
# size = 10: Sets the text size to 10.
# theme(legend.position = "none"): Removes the legend from the plot.
table %>%
ggplot(aes(area = sum_courses, fill = erstfach_name, label = fach) )+
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre",
size = 10) +
theme(legend.position = "none")Donut charts are an alternative chart for pie charts, which have a hole in the middle, making them cleaner to read than pie charts. There are ways to do this without an additional package and there are other packages, but this is what we found to be the easiest.
Here comes the donut plot.
PieChart(partei, # Defines grouping variable
data = data, # Defines data
hole = 0.2, # Defines the size of the donut hole
main = NULL # Removes title
)## >>> suggestions
## PieChart(partei, hole=0) # traditional pie chart
## PieChart(partei, labels="%") # display %'s on the chart
## PieChart(partei) # bar chart
## Plot(partei) # bubble plot
## Plot(partei, labels="count") # lollipop plot
##
## --- partei ---
##
## CDU/CSU FDP Grüne Keine Linke Sonstige SPD Total
## Frequencies: 22 25 40 26 10 44 24 191
## Proportions: 0.115 0.131 0.209 0.136 0.052 0.230 0.126 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 28.785, df = 6, p-value = 0.000
The ggstats package provides new statistics, new
geometries and new positions for ggplot2 and a suite of functions to
facilitate the creation of statistical plots.
# Combines three vectors into a matrix: 'risk_affine', a numeric conversion
# of 'statistiknote_w1' (where commas are replaced by periods and
# periods are removed), and 'paar_schuhe'.
reg_data <- data %>%
select(statistiknote_w1, abiturnote,
uebungen_w1, wegdauer) %>%
drop_na()# Runs a linear regression with the expected statistics grade as dependent
# and the abiturnote and the expected number of participated tutorials
# as independent variable.
mod1 <- lm(statistiknote_w1 ~ abiturnote + uebungen_w1, data = reg_data)
summary(mod1) # Standard summary of the regression output. ##
## Call:
## lm(formula = statistiknote_w1 ~ abiturnote + uebungen_w1, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14485 -0.51276 0.08543 0.54084 1.87771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.08651 0.40581 5.142 0.000000683 ***
## abiturnote 0.26459 0.09361 2.826 0.00522 **
## uebungen_w1 0.01959 0.02495 0.785 0.43330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.778 on 187 degrees of freedom
## Multiple R-squared: 0.0411, Adjusted R-squared: 0.03084
## F-statistic: 4.007 on 2 and 187 DF, p-value: 0.01977
## `height` was translated to `width`.
Sometimes it is nice to visualize correlation matrices. Here we run a small toy example, but is especially helpful if there are many continous variables in the data set.
We use the dataset from the example before.
## statistiknote_w1 abiturnote uebungen_w1 wegdauer
## statistiknote_w1 1.00000000 0.1947652 0.01141229 -0.1735601
## abiturnote 0.19476522 1.0000000 -0.22285379 0.2245036
## uebungen_w1 0.01141229 -0.2228538 1.00000000 -0.3903103
## wegdauer -0.17356015 0.2245036 -0.39031028 1.0000000
Sometimes, we want to combine multiple plots into one larger plot. We use the previous examples, start by giving all the plots a name and then combining them into one large plot.
# Recreating the previous plots.
# Plot 1
p1 <- data2 %>%
ggplot(aes(values=Freq, fill=erstfach_name))+
geom_waffle(
n_rows = 10, # Specifies the number of squares in each row.
color = "white", # Specifies border color.
flip = TRUE, # Defines orientation
na.rm=TRUE # Removes missing values
)+
facet_grid(~male)+ # Divides the plot by gender (remove to see the change)
coord_equal()+
theme_minimal()+
theme(axis.title.y = element_blank(), # Removes title, axis titles and texts,
axis.text.y = element_blank(), # and the legend title
axis.text.x = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank())
# Plot 2
p2 <- ggcoef_model(mod1)
# Plot 3
p3 <- ggcorrplot(corr)Now lets arrange them together.
## `height` was translated to `width`.
FYI: If you want to combine multiple plots with the same legend, you can also specify that you want to only display one of them.
There are SO many more possibilities to create graphs. Just find them online and try them out! You can create animated graphs – here is a good example for an animated economic graph , interactive graphs, 3D graphs and many more.
Our last section will be a very quick checklist on what to avoid when plotting data.
Good visualization simply messages and make the main data of interest as easily understandable as possible. Try to avoid wild colors, strange scales, or unnecessary information.
Do you want to highlight a contrast or show a continuous evolution. Be aware of the associations to colors (blue – cold, red – warm, etc.).
Maybe show the graph to someone not too familiar with the data to find out if the graph is self-explanatory.
This should be a no-brainer, however it is important to mention.
ggplot2 · UC Business Analytics R
Programming Guide. (Boehmke, B.). https://uc-r.github.io/ggplot_intro.